Skip to main content

Getting Started

Load testing helps ensure your Unpod applications can handle concurrent user traffic and maintain optimal performance under various load conditions.

Performance Targets and SLAs

The platform guarantees 99.99% uptime with automatic failover. SLA documentation

Metric Commitment

MetricCommitmentCredit Policy
Uptime SLA99.90% available0.5% credit per 0.1% below
End-to-End Latency p99less than 1500msIncluded in E2E
WebApp Service Latencyless than 10ms internal routingIncluded in E2E
Vector Store Query p99less than 50msIncluded in E2E
MongoDB Write Fire and Forgetless than 40msIncluded in E2E
Data Purge VerificationOn-demand auditIncluded Enterprise

Baseline Performance Metrics

The following metrics represent measured performance under optimal baseline conditions single session, warm cache, optimal network:
ComponentMeasured p50Measured p95
Platform Orchestration8ms12ms
Speech-to-Text STT0.5s0.7s
LLM Inference0.8s1.2s
Text-to-Speech TTS0.3s0.5s
End-to-End Voice Pipeline1.6s2.4s

Concurrent Load Test Results

Platform stability validated under concurrent session load:
Test ScenarioConcurrencySuccess RateAvg Latency
Baseline Single Session1100%1.6s
Low Concurrency5100%1.65s
Medium Concurrency10100%1.7s
High Concurrency15100%1.7s

Infrastructure Robustness

CapabilityStatus
Auto-scalingHorizontal pod scaling enabled
FailoverMulti-region redundancy
Connection PoolingOptimized for concurrent sessions
Rate LimitingPer-tenant throttling
ObservabilityReal-time latency monitoring
Data ResidencyIndia region available

Scalability Architecture

  • Horizontal Scaling: Native HPA Horizontal Pod Autoscaler for all stateless components
  • GPU Node Affinity: Dedicated GPU pools with NVIDIA A10G L4 for inference workloads
  • Regional Infrastructure: Automatic routing through worldwide infrastructure for optimal latency
  • Database Scaling: Postgres read replicas, MongoDB ReplicaSet with automatic failover
  • SaaS Auto-scaling: Instant autoscale vs manual capacity planning for self-hosted

Latency Optimization Techniques

  • Streaming STT TTS: Real-time processing without full-file buffering
  • Speculative Decoding: Parallel token generation for faster LLM responses
  • Same Availability Zone: Co-located services to minimize network latency
  • gRPC WebSocket: Low-overhead protocols for inter-service communication

Notes

  • End-to-End latency includes external service providers STT, LLM, TTS which contribute to variability under load.
  • Platform orchestration layer maintains less than 15ms latency regardless of concurrent load.
  • Performance optimizations for high-concurrency scenarios are actively being deployed.
  • Custom SLA tiers available for enterprise customers with dedicated infrastructure.